Credit Card Users Churn Prediction¶

Context¶

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Objective¶

Customers’ leaving credit card services would lead the bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and the reason for same – so that the bank could improve upon those areas. You as a Data Scientist at Thera Bank need to explore the data provided, identify patterns, and come up with a classification model to identify customers likely to churn, and provide actionable insights and recommendations that will help the bank improve its services so that customers do not renounce their credit cards.

Data Description¶

  • CLIENTNUM: Client number. Unique identifier for the customer holding the account
  • Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
  • Customer_Age: Age in Years
  • Gender: The gender of the account holder
  • Dependent_count: Number of dependents
  • Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.
  • Marital_Status: Marital Status of the account holder
  • Income_Category: Annual Income Category of the account holder
  • Card_Category: Type of Card
  • Months_on_book: Period of relationship with the bank
  • Total_Relationship_Count: Total no. of products held by the customer
  • Months_Inactive_12_mon: No. of months inactive in the last 12 months
  • Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months
  • Credit_Limit: Credit Limit on the Credit Card
  • Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance
  • Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)
  • Total_Trans_Amt: Total Transaction Amount (Last 12 months)
  • Total_Trans_Ct: Total Transaction Count (Last 12 months)
  • Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in the 1st quarter
  • Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in the 1st quarter
  • Avg_Utilization_Ratio: Represents how much of the available credit the customer spent

Exploratory Data Analysis¶

Import Library¶

In [201]:
import pandas as pd
import matplotlib.pyplot  as plt
import seaborn as sns
import numpy as np

from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder
from imblearn.over_sampling import SMOTE

# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
)
from xgboost import XGBClassifier

from sklearn.dummy import DummyClassifier

# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    ConfusionMatrixDisplay,
    RocCurveDisplay,
)

# To be used for data scaling and encoding
from sklearn.preprocessing import (
    StandardScaler,
    MinMaxScaler,
    OneHotEncoder,
    RobustScaler,
)
from sklearn.impute import SimpleImputer

# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
In [202]:
#Read the dataset
df = pd.read_csv('BankChurners.csv')
df.head()
Out[202]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book ... Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 768805383 Existing Customer 45 M 3 High School Married $60K - $80K Blue 39 ... 1 3 12691.0 777 11914.0 1.335 1144 42 1.625 0.061
1 818770008 Existing Customer 49 F 5 Graduate Single Less than $40K Blue 44 ... 1 2 8256.0 864 7392.0 1.541 1291 33 3.714 0.105
2 713982108 Existing Customer 51 M 3 Graduate Married $80K - $120K Blue 36 ... 1 0 3418.0 0 3418.0 2.594 1887 20 2.333 0.000
3 769911858 Existing Customer 40 F 4 High School NaN Less than $40K Blue 34 ... 4 1 3313.0 2517 796.0 1.405 1171 20 2.333 0.760
4 709106358 Existing Customer 40 M 3 Uneducated Married $60K - $80K Blue 21 ... 1 0 4716.0 0 4716.0 2.175 816 28 2.500 0.000

5 rows × 21 columns

Shape of Dataframe¶

In [203]:
df.shape
Out[203]:
(10127, 21)
  • There are 10127 rows with 21 columns

Info of Dataframe¶

In [204]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CLIENTNUM                 10127 non-null  int64  
 1   Attrition_Flag            10127 non-null  object 
 2   Customer_Age              10127 non-null  int64  
 3   Gender                    10127 non-null  object 
 4   Dependent_count           10127 non-null  int64  
 5   Education_Level           8608 non-null   object 
 6   Marital_Status            9378 non-null   object 
 7   Income_Category           10127 non-null  object 
 8   Card_Category             10127 non-null  object 
 9   Months_on_book            10127 non-null  int64  
 10  Total_Relationship_Count  10127 non-null  int64  
 11  Months_Inactive_12_mon    10127 non-null  int64  
 12  Contacts_Count_12_mon     10127 non-null  int64  
 13  Credit_Limit              10127 non-null  float64
 14  Total_Revolving_Bal       10127 non-null  int64  
 15  Avg_Open_To_Buy           10127 non-null  float64
 16  Total_Amt_Chng_Q4_Q1      10127 non-null  float64
 17  Total_Trans_Amt           10127 non-null  int64  
 18  Total_Trans_Ct            10127 non-null  int64  
 19  Total_Ct_Chng_Q4_Q1       10127 non-null  float64
 20  Avg_Utilization_Ratio     10127 non-null  float64
dtypes: float64(5), int64(10), object(6)
memory usage: 1.6+ MB
  • There are 6 categorical columns and remaining are all numerical columns
  • CLIENTNUM can be dropped as that doesnot help in identifying any patterns
In [205]:
df = df.drop(['CLIENTNUM'], axis = 1)
In [206]:
df.head()
Out[206]:
Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 Existing Customer 45 M 3 High School Married $60K - $80K Blue 39 5 1 3 12691.0 777 11914.0 1.335 1144 42 1.625 0.061
1 Existing Customer 49 F 5 Graduate Single Less than $40K Blue 44 6 1 2 8256.0 864 7392.0 1.541 1291 33 3.714 0.105
2 Existing Customer 51 M 3 Graduate Married $80K - $120K Blue 36 4 1 0 3418.0 0 3418.0 2.594 1887 20 2.333 0.000
3 Existing Customer 40 F 4 High School NaN Less than $40K Blue 34 3 4 1 3313.0 2517 796.0 1.405 1171 20 2.333 0.760
4 Existing Customer 40 M 3 Uneducated Married $60K - $80K Blue 21 5 1 0 4716.0 0 4716.0 2.175 816 28 2.500 0.000
In [207]:
df.describe()
Out[207]:
Customer_Age Dependent_count Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
count 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000
mean 46.325960 2.346203 35.928409 3.812580 2.341167 2.455317 8631.953698 1162.814061 7469.139637 0.759941 4404.086304 64.858695 0.712222 0.274894
std 8.016814 1.298908 7.986416 1.554408 1.010622 1.106225 9088.776650 814.987335 9090.685324 0.219207 3397.129254 23.472570 0.238086 0.275691
min 26.000000 0.000000 13.000000 1.000000 0.000000 0.000000 1438.300000 0.000000 3.000000 0.000000 510.000000 10.000000 0.000000 0.000000
25% 41.000000 1.000000 31.000000 3.000000 2.000000 2.000000 2555.000000 359.000000 1324.500000 0.631000 2155.500000 45.000000 0.582000 0.023000
50% 46.000000 2.000000 36.000000 4.000000 2.000000 2.000000 4549.000000 1276.000000 3474.000000 0.736000 3899.000000 67.000000 0.702000 0.176000
75% 52.000000 3.000000 40.000000 5.000000 3.000000 3.000000 11067.500000 1784.000000 9859.000000 0.859000 4741.000000 81.000000 0.818000 0.503000
max 73.000000 5.000000 56.000000 6.000000 6.000000 6.000000 34516.000000 2517.000000 34516.000000 3.397000 18484.000000 139.000000 3.714000 0.999000
  • Age Range:

The age of customers in your dataset ranges from 26 to 73 years. The majority of customers fall between 41 and 52 years old, with the median age being 46. This indicates a relatively mature customer base.

  • Credit Limit:

Credit limits vary widely from as low as $1,438.30 to as high as $34,516.00. The median credit limit is $4,549.00, with the upper quartile at $11,067.50. This shows a substantial range in customers' available credit, suggesting that the dataset includes both lower and higher credit limits.

  • Monthly Activity:

Customers' total transaction amounts vary from $510 to $18,484, with a median of $3,899. The upper quartile is $4,741, indicating that a significant portion of customers has relatively high transaction activity. This suggests that spending behavior can vary significantly among customers.

  • Revolving Balance:

The total revolving balance, which ranges from $0 to $2,517, has a median of $1,276. Most customers have revolving balances that are on the lower end, with the 75th percentile at $1,784. This may reflect either conservative spending or effective credit management by many customers.

  • Utilization Ratio:

The average utilization ratio, which measures the proportion of available credit being used, ranges from 0.00 to 0.999. The median utilization ratio is 0.176, while the 75th percentile is 0.503. This indicates that while some customers use a significant portion of their available credit, many maintain relatively low utilization rates, which can be indicative of good credit management practices.

In [208]:
df.describe(include = 'O')
Out[208]:
Attrition_Flag Gender Education_Level Marital_Status Income_Category Card_Category
count 10127 10127 8608 9378 10127 10127
unique 2 2 6 3 6 4
top Existing Customer F Graduate Married Less than $40K Blue
freq 8500 5358 3128 4687 3561 9436
  • The above are the top,unique and frequent values for categorical columns

Duplicate values¶

In [209]:
#check if there are duplicates values
df.duplicated().sum()
Out[209]:
0
  • There are no duplicates

Univariate Analysis¶

Target Variable distribution¶

In [210]:
target_counts = df['Attrition_Flag'].value_counts()
# Plotting the pie chart
plt.figure(figsize=(8, 6))  # Optional: set figure size
plt.pie(target_counts, labels=target_counts.index, autopct='%1.1f%%', startangle=140, colors=plt.cm.Paired(range(len(target_counts))))
plt.title('Distribution of Target Variable')
plt.show()
No description has been provided for this image
In [211]:
print('Total number of customers:', df['Attrition_Flag'].count())
print(df['Attrition_Flag'].value_counts())
Total number of customers: 10127
Attrition_Flag
Existing Customer    8500
Attrited Customer    1627
Name: count, dtype: int64

Categorical variables distribution¶

Gender¶

In [212]:
target_counts = df['Gender'].value_counts()
# Plotting the pie chart
plt.figure(figsize=(8, 6))  # Optional: set figure size
plt.pie(target_counts, labels=target_counts.index, autopct='%1.1f%%', startangle=140, colors=plt.cm.Paired(range(len(target_counts))))
plt.title('Gender')
plt.show()
No description has been provided for this image
In [213]:
print('Total count:', df['Gender'].count())
print(df['Gender'].value_counts())
Total count: 10127
Gender
F    5358
M    4769
Name: count, dtype: int64
  • There are more females than males in the dataset

Education level¶

In [214]:
# Create a count plot
sns.countplot(x='Education_Level', data=df)

# Customize the plot
plt.title('Education Levels')
plt.ylabel('Count')
plt.xlabel('Education Level')

# Show the plot
plt.show()
No description has been provided for this image
  • There are more Graduates in the dataset

Marital Status¶

In [215]:
# Create a count plot
sns.countplot(x='Marital_Status', data=df)

# Customize the plot
plt.title('Marital Status')
plt.ylabel('Count')
plt.xlabel('Marital Status')

# Show the plot
plt.show()
No description has been provided for this image
  • There are more Married Status in the dataset

Income Category¶

In [216]:
# Customize the plot
plt.figure(figsize=(8, 6))
sns.countplot(x='Income_Category', data=df)
plt.title('Income Category')
plt.ylabel('Count')
plt.xlabel('Income Category')
# Rotate the x-axis labels to prevent overlap
plt.xticks(rotation=15)

# Show the plot
plt.show()
No description has been provided for this image
  • Most of the customers are in 'less than $40K' income category

Card Category¶

In [217]:
# Create a count plot
sns.countplot(x='Card_Category', data=df)

# Customize the plot
plt.title('Card Category')
plt.ylabel('Count')
plt.xlabel('Card Category')

# Show the plot
plt.show()
No description has been provided for this image
  • There are more 'Blue' categorised customers in the dataset

BiVariate Analysis¶

Numerical variables by Target¶

In [218]:
# Customer Age
sns.displot( data=df,  x='Customer_Age',  hue='Attrition_Flag', bins=[26, 35, 45, 55, 75], multiple='stack')
plt.title('Customer Age')
plt.show()
No description has been provided for this image
In [219]:
# Dependent_count
sns.displot( data=df,  x='Dependent_count',  hue='Attrition_Flag', multiple='stack')
plt.title('Dependent count')
plt.show()
No description has been provided for this image
In [220]:
# Months_on_book
sns.displot( data=df,  x='Months_on_book',  hue='Attrition_Flag', bins=15, multiple='stack')
plt.title('Months on book')
plt.show()
No description has been provided for this image
In [221]:
#Total_Relationship_Count
sns.displot( data=df,  x='Total_Relationship_Count',  hue='Attrition_Flag', bins=15, multiple='stack')
plt.title('Total Relationship Count')
plt.show()
No description has been provided for this image
In [222]:
# Months_Inactive_12_mon
sns.displot( data=df,  x='Months_Inactive_12_mon',  hue='Attrition_Flag', bins=15, multiple='stack')
plt.title('Months Inactive 12_mon Count')
plt.show()
No description has been provided for this image
In [223]:
#Contacts_Count_12_mon
sns.displot( data=df,  x='Contacts_Count_12_mon',  hue='Attrition_Flag', bins=15, multiple='stack')
plt.title('Contacts 12_mon Count')
plt.show()
No description has been provided for this image
In [224]:
#Credit_Limit
sns.displot( data=df,  x='Credit_Limit',  hue='Attrition_Flag', bins=25, multiple='stack')
plt.title('Credit Limit')
plt.show()
No description has been provided for this image
In [225]:
#Total_Revolving_Bal
sns.displot( data=df,  x='Total_Revolving_Bal',  hue='Attrition_Flag', bins=15, multiple='stack')
plt.title('Total Revolving Bal')
plt.show()
No description has been provided for this image
In [226]:
#Avg_Open_To_Buy
sns.displot( data=df,  x='Avg_Open_To_Buy',  hue='Attrition_Flag', bins=15, multiple='stack')
plt.title('Avg Open To Buy')
plt.show()
No description has been provided for this image
In [227]:
#Total_Amt_Chng_Q4_Q1
sns.displot( data=df,  x='Total_Amt_Chng_Q4_Q1',  hue='Attrition_Flag', bins=15, multiple='stack')
plt.title('Total Amt Chng Q4_Q1')
plt.show()
No description has been provided for this image
In [228]:
#Total_Trans_Amt
sns.displot( data=df,  x='Total_Trans_Amt',  hue='Attrition_Flag', bins=15, multiple='stack')
plt.title('Total Trans Amt')
plt.show()
No description has been provided for this image
In [229]:
#Total_Trans_Ct
sns.displot( data=df,  x='Total_Trans_Ct',  hue='Attrition_Flag', bins=15, multiple='stack')
plt.title('Total Trans Ct')
plt.show()
No description has been provided for this image
In [230]:
#Total_Ct_Chng_Q4_Q1
sns.displot( data=df,  x='Total_Ct_Chng_Q4_Q1',  hue='Attrition_Flag', bins=15, multiple='stack')
plt.title('Total Ct Chng Q4_Q1')
plt.show()
No description has been provided for this image
In [231]:
#Avg_Utilization_Ratio
sns.displot( data=df,  x='Avg_Utilization_Ratio',  hue='Attrition_Flag', bins=15, multiple='stack')
plt.title('Avg Utilization Ratio')
plt.show()
No description has been provided for this image
In [232]:
plt.figure(figsize=(20, 20))
sns.set(palette="nipy_spectral")
sns.pairplot(data=df, hue="Attrition_Flag", corner=True)
Out[232]:
<seaborn.axisgrid.PairGrid at 0x148fa6710>
<Figure size 2000x2000 with 0 Axes>
No description has been provided for this image

Data Preprocessing¶

Outliers¶

In [233]:
# Get all numerical columns
numerical_columns = df.select_dtypes(include=['number'])
# Melt the DataFrame to long format for box plotting
df_melted = df.melt(id_vars='Attrition_Flag', value_vars=numerical_columns.columns.tolist(), var_name='Num Variables', value_name='Count')

# Create a box plot for all numerical columns
plt.figure(figsize=(20, 10))
sns.boxplot(x='Num Variables', y='Count', hue='Attrition_Flag', data=df_melted)
plt.xticks(rotation=15)
# Customize the plot
plt.title('Box Plot of All Numerical Columns by Category', fontsize=16, fontweight='semibold')
plt.xlabel('Num Variables')
plt.ylabel('Count')

# Show the plot
plt.show()
No description has been provided for this image

Correlation¶

In [234]:
cm = numerical_columns.corr()
plt.figure(figsize = (14, 10))
sns.heatmap(cm, annot = True, cmap = 'viridis')
Out[234]:
<Axes: >
No description has been provided for this image

Insights:¶

  • Strong Positive Correlation between Total_Trans_Ct and Total_Trans_Amt:

The highest correlation in the chart is between Total_Trans_Ct and Total_Trans_Amt with a value of 0.81. This suggests that the more transactions a customer makes, the higher their total transaction amount is.

  • Credit_Limit and Avg_Open_To_Buy:

There is a strong positive correlation (0.62) between Credit_Limit and Avg_Open_To_Buy, indicating that customers with higher credit limits also tend to have more available credit.

  • Avg_Utilization_Ratio and Total_Revolving_Bal:

A moderate positive correlation (0.62) exists between Avg_Utilization_Ratio and Total_Revolving_Bal, suggesting that as the revolving balance increases, the credit utilization ratio also increases.

  • Credit_Limit and Avg_Utilization_Ratio:

There is a moderate negative correlation (-0.48) between Credit_Limit and Avg_Utilization_Ratio, indicating that customers with higher credit limits tend to have a lower utilization ratio.

  • Total_Trans_Amt and Avg_Open_To_Buy:

A weak positive correlation (0.17) exists between Total_Trans_Amt and Avg_Open_To_Buy, which suggests a slight relationship between how much a customer spends and how much available credit they have.

Null Values¶

In [235]:
df.isnull().sum()
Out[235]:
Attrition_Flag                 0
Customer_Age                   0
Gender                         0
Dependent_count                0
Education_Level             1519
Marital_Status               749
Income_Category                0
Card_Category                  0
Months_on_book                 0
Total_Relationship_Count       0
Months_Inactive_12_mon         0
Contacts_Count_12_mon          0
Credit_Limit                   0
Total_Revolving_Bal            0
Avg_Open_To_Buy                0
Total_Amt_Chng_Q4_Q1           0
Total_Trans_Amt                0
Total_Trans_Ct                 0
Total_Ct_Chng_Q4_Q1            0
Avg_Utilization_Ratio          0
dtype: int64
  • Education_Level and Marital_Status are to be treated for missing values
In [236]:
df['Education_Level'] = df["Education_Level"].fillna('Unknown')
df['Marital_Status'] = df["Marital_Status"].fillna('Unknown')
In [237]:
df.isnull().sum()
Out[237]:
Attrition_Flag              0
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64
  • No more missing values
In [238]:
#Check for unique values
for column in df.columns:
    unique_values = df[column].unique()
    print(f"Column '{column}' has {len(unique_values)} unique values:")
    print(unique_values)
    print("\n")
Column 'Attrition_Flag' has 2 unique values:
['Existing Customer' 'Attrited Customer']


Column 'Customer_Age' has 45 unique values:
[45 49 51 40 44 32 37 48 42 65 56 35 57 41 61 47 62 54 59 63 53 58 55 66
 50 38 46 52 39 43 64 68 67 60 73 70 36 34 33 26 31 29 30 28 27]


Column 'Gender' has 2 unique values:
['M' 'F']


Column 'Dependent_count' has 6 unique values:
[3 5 4 2 0 1]


Column 'Education_Level' has 7 unique values:
['High School' 'Graduate' 'Uneducated' 'Unknown' 'College' 'Post-Graduate'
 'Doctorate']


Column 'Marital_Status' has 4 unique values:
['Married' 'Single' 'Unknown' 'Divorced']


Column 'Income_Category' has 6 unique values:
['$60K - $80K' 'Less than $40K' '$80K - $120K' '$40K - $60K' '$120K +'
 'abc']


Column 'Card_Category' has 4 unique values:
['Blue' 'Gold' 'Silver' 'Platinum']


Column 'Months_on_book' has 44 unique values:
[39 44 36 34 21 46 27 31 54 30 48 37 56 42 49 33 28 38 41 43 45 52 40 50
 35 47 32 20 29 25 53 24 55 23 22 26 13 51 19 15 17 18 16 14]


Column 'Total_Relationship_Count' has 6 unique values:
[5 6 4 3 2 1]


Column 'Months_Inactive_12_mon' has 7 unique values:
[1 4 2 3 6 0 5]


Column 'Contacts_Count_12_mon' has 7 unique values:
[3 2 0 1 4 5 6]


Column 'Credit_Limit' has 6205 unique values:
[12691.  8256.  3418. ...  5409.  5281. 10388.]


Column 'Total_Revolving_Bal' has 1974 unique values:
[ 777  864    0 ...  534  476 2241]


Column 'Avg_Open_To_Buy' has 6813 unique values:
[11914.  7392.  3418. ... 11831.  5409.  8427.]


Column 'Total_Amt_Chng_Q4_Q1' has 1158 unique values:
[1.335 1.541 2.594 ... 0.222 0.204 0.166]


Column 'Total_Trans_Amt' has 5033 unique values:
[ 1144  1291  1887 ... 10291  8395 10294]


Column 'Total_Trans_Ct' has 126 unique values:
[ 42  33  20  28  24  31  36  32  26  17  29  27  21  30  16  18  23  22
  40  38  25  43  37  19  35  15  41  57  12  14  34  44  13  47  10  39
  53  50  52  48  49  45  11  55  46  54  60  51  63  58  59  61  78  64
  65  62  67  66  56  69  71  75  74  76  84  82  88  68  70  73  86  72
  79  80  85  81  87  83  91  89  77 103  93  96  99  92  90  94  95  98
 100 102  97 101 104 105 106 107 109 118 108 122 113 112 111 127 114 124
 110 120 125 121 117 126 134 116 119 129 131 115 128 139 123 130 138 132]


Column 'Total_Ct_Chng_Q4_Q1' has 830 unique values:
[1.625 3.714 2.333 2.5   0.846 0.722 0.714 1.182 0.882 0.68  1.364 3.25
 2.    0.611 1.7   0.929 1.143 0.909 0.6   1.571 0.353 0.75  0.833 1.3
 1.    0.9   2.571 1.6   1.667 0.483 1.176 1.2   0.556 0.143 0.474 0.917
 1.333 0.588 0.8   1.923 0.25  0.364 1.417 1.083 1.25  0.5   1.154 0.733
 0.667 2.4   1.05  0.286 0.4   0.522 0.435 1.875 0.966 1.412 0.526 0.818
 1.8   1.636 2.182 0.619 0.933 1.222 0.304 0.727 0.385 1.5   0.789 0.542
 1.1   1.095 0.824 0.391 0.346 3.    1.056 1.118 0.786 0.625 1.533 0.382
 0.355 0.765 0.778 2.2   1.545 0.7   1.211 1.231 0.636 0.455 2.875 1.308
 0.467 1.909 0.571 0.812 2.429 0.706 2.167 0.263 0.429 2.286 0.828 1.467
 0.478 0.867 0.88  1.444 1.273 0.941 0.684 0.591 0.762 0.529 0.615 0.519
 0.421 0.947 1.167 1.105 0.737 1.263 0.538 1.071 0.357 0.407 0.923 1.455
 0.35  2.273 0.69  0.65  0.167 0.647 1.615 0.545 0.875 1.125 0.462 1.294
 1.357 3.5   1.067 1.286 0.524 1.214 0.273 1.538 0.783 0.235 0.607 2.083
 0.632 0.368 0.444 0.76  0.536 0.438 0.423 2.1   0.565 0.719 0.182 1.75
 0.944 0.581 0.333 0.643 0.87  0.692 1.227 0.938 1.833 0.652 1.462 0.583
 0.679 0.375 1.091 2.75  1.385 1.188 0.261 1.312 0.656 1.235 0.958 0.37
 0.059 0.3   0.613 1.778 0.955 0.864 1.429 0.889 1.438 0.481 0.452 1.13
 0.562 1.048 0.409 0.622 0.688 1.217 0.211 0.606 0.655 0.381 1.053 1.316
 0.575 0.85  0.41  0.609 1.579 0.56  0.276 0.533 0.515 0.308 0.852 0.371
 0.214 0.63  0.231 0.406 0.405 0.349 0.857 0.212 0.543 1.059 0.579 0.387
 0.724 0.415 0.895 0.781 0.412 0.649 0.32  0.345 0.367 0.586 0.324 0.306
 0.676 0.708 0.476 0.29  0.55  0.133 0.344 0.52  0.471 0.842 0.654 0.516
 0.464 1.857 0.629 0.963 0.686 0.323 0.585 0.633 0.92  0.441 0.424 0.59
 0.763 0.207 0.314 2.222 1.45  0.469 3.571 0.696 0.741 0.512 1.043 0.568
 0.548 0.194 0.552 0.448 0.651 0.393 0.657 0.682 0.808 1.032 0.577 0.241
 0.425 0.348 0.318 0.292 0.312 0.486 0.969 0.697 0.389 0.44  0.829 0.677
 0.189 0.259 0.72  0.815 1.15  0.806 0.537 0.721 0.531 0.472 0.594 0.773
 0.826 0.906 0.417 0.758 1.107 0.621 0.458 0.267 0.107 0.459 0.71  0.487
 0.95  0.321 0.414 0.742 0.739 0.767 0.394 0.091 0.926 0.618 0.784 0.208
 1.136 0.897 0.593 0.294 0.718 1.375 0.862 0.439 0.839 0.595 1.208 0.96
 0.514 0.433 0.484 1.08  0.931 0.233 0.971 0.957 1.038 0.48  0.731 1.474
 1.062 0.608 1.103 1.111 0.725 1.647 0.774 0.477 0.238 0.967 0.769 0.576
 0.567 1.042 0.759 0.81  1.069 0.574 0.528 0.278 0.703 0.447 0.028 0.297
 1.037 0.269 0.962 0.905 0.111 0.513 0.31  0.614 0.436 0.45  1.48  0.296
 0.879 1.114 0.262 1.278 0.257 0.517 1.36  0.605 1.04  0.711 0.844 0.623
 0.913 0.756 1.045 0.775 0.645 0.793 0.488 0.511 0.811 0.838 0.641 0.646
 0.972 0.559 0.659 0.525 0.038 0.871 0.919 0.179 0.639 0.077 0.564 0.419
 0.853 0.64  0.848 1.033 0.351 0.675 0.743 0.952 1.077 1.087 1.12  0.885
 0.592 0.893 0.265 1.292 0.457 0.771 0.977 0.053 1.318 0.809 0.674 0.968
 0.316 0.15  0.558 0.485 0.735 0.275 0.19  1.381 0.379 0.689 0.561 0.174
 0.217 1.174 0.766 0.683 0.    0.281 0.28  0.492 0.788 0.865 0.881 0.794
 0.712 0.658 0.891 1.24  0.911 0.946 0.2   0.465 0.489 0.541 0.86  0.628
 0.062 0.795 1.722 0.892 0.578 0.704 0.732 0.587 0.956 0.185 0.341 0.58
 0.378 1.036 0.549 0.491 0.702 0.638 0.176 0.912 0.535 0.521 0.653 0.604
 0.73  0.66  1.139 0.509 1.882 0.463 0.634 0.694 1.148 0.757 1.35  0.362
 0.822 0.755 0.395 0.861 0.738 1.133 0.872 0.886 1.156 0.532 1.03  0.453
 0.821 1.034 0.635 0.154 0.903 1.207 1.31  0.523 0.878 0.744 0.317 0.93
 0.24  0.804 0.761 0.54  0.479 0.551 1.4   0.553 0.426 0.816 0.698 0.227
 0.896 0.792 1.051 0.61  0.884 0.408 0.617 0.935 0.361 0.902 0.78  0.841
 0.796 0.975 1.081 0.707 0.422 0.964 0.172 0.805 0.717 0.347 1.138 0.791
 0.681 0.256 1.609 0.868 0.468 0.432 1.121 0.787 0.596 0.976 1.158 1.028
 0.949 0.451 0.456 0.837 1.212 0.673 0.222 0.171 0.51  0.685 0.396 0.388
 0.644 0.914 1.476 0.46  0.547 1.421 0.825 0.729 0.723 1.471 0.939 0.974
 0.943 0.84  0.627 0.13  1.147 0.327 1.065 0.705 1.37  0.854 0.951 0.569
 0.921 0.776 0.927 0.449 0.475 0.97  1.097 0.612 1.024 1.088 0.648 0.242
 0.661 0.745 1.522 0.843 0.907 1.027 1.783 0.62  0.814 1.026 0.851 1.094
 0.431 1.057 0.226 0.736 0.103 1.29  0.925 0.566 0.161 0.303 1.152 1.65
 0.74  1.194 1.226 0.642 1.323 1.025 1.074 0.508 0.49  0.534 0.83  0.978
 1.206 1.054 0.936 0.932 0.105 1.061 1.031 1.478 0.898 0.672 0.188 0.518
 0.953 1.049 1.086 0.691 0.411 1.029 1.419 1.075 0.206 0.973 1.219 1.162
 0.827 1.321 0.343 0.764 0.125 0.119 1.189 1.179 1.258 1.229 1.073 0.074
 1.458 1.172 1.32  1.108 1.16  0.36  1.391 1.583 0.147 1.115 0.359 1.128
 0.915 0.282 0.162 1.303 0.582 1.382 1.171 0.029 1.161 0.192 1.346 0.473
 0.097 0.82  0.557 0.894 1.135 1.367 1.023 0.544 0.589 0.603 0.442 0.295
 0.434 0.554 0.372 0.527 0.709 0.782 0.797 0.695 0.849 0.768 0.863 0.746
 0.597 0.631 0.678 0.887 0.754 0.687 0.699 0.873 0.716 0.934 0.847 0.244
 0.803 0.772 0.859 1.064 0.819 0.573 0.807 0.79  0.817 0.785 0.823 0.836
 0.616 0.831 1.06  1.122 0.866 0.662 0.869 0.779 0.981 0.293 0.855 0.98
 0.671 1.079 0.693 0.77  1.093 1.018 1.022 0.734 0.753 0.726 0.922 0.948
 1.684 0.918]


Column 'Avg_Utilization_Ratio' has 964 unique values:
[0.061 0.105 0.    0.76  0.311 0.066 0.048 0.113 0.144 0.217 0.174 0.195
 0.279 0.23  0.078 0.095 0.788 0.08  0.086 0.152 0.626 0.215 0.093 0.099
 0.285 0.658 0.69  0.282 0.562 0.135 0.544 0.757 0.241 0.077 0.018 0.355
 0.145 0.209 0.793 0.074 0.259 0.591 0.687 0.127 0.667 0.843 0.422 0.156
 0.525 0.587 0.211 0.088 0.111 0.044 0.276 0.704 0.656 0.053 0.051 0.467
 0.698 0.067 0.079 0.287 0.36  0.256 0.719 0.198 0.14  0.035 0.619 0.108
 0.062 0.765 0.963 0.524 0.347 0.45  0.232 0.299 0.085 0.059 0.43  0.62
 0.027 0.169 0.058 0.223 0.057 0.513 0.473 0.047 0.106 0.05  0.03  0.615
 0.15  0.407 0.191 0.096 0.176 0.83  0.412 0.678 0.246 0.271 0.114 0.395
 0.406 0.258 0.178 0.941 0.141 0.118 0.119 0.64  0.432 0.612 0.359 0.309
 0.101 0.607 0.512 0.806 0.463 0.77  0.076 0.133 0.037 0.146 0.171 0.069
 0.837 0.055 0.294 0.39  0.19  0.692 0.503 0.251 0.11  0.087 0.214 0.164
 0.049 0.043 0.679 0.098 0.694 0.039 0.199 0.22  0.13  0.202 0.319 0.165
 0.863 0.665 0.598 0.539 0.472 0.064 0.16  0.42  0.713 0.092 0.336 0.666
 0.147 0.987 0.073 0.88  0.28  0.65  0.761 0.072 0.327 0.459 0.252 0.244
 0.291 0.46  0.489 0.482 0.24  0.197 0.866 0.317 0.762 0.162 0.196 0.734
 0.446 0.262 0.042 0.094 0.308 0.68  0.238 0.753 0.877 0.724 0.117 0.638
 0.102 0.131 0.255 0.716 0.609 0.405 0.154 0.605 0.275 0.06  0.07  0.186
 0.648 0.167 0.153 0.79  0.732 0.123 0.221 0.2   0.063 0.785 0.771 0.224
 0.795 0.187 0.583 0.316 0.447 0.625 0.514 0.557 0.955 0.867 0.846 0.756
 0.31  0.373 0.935 0.155 0.435 0.932 0.829 0.953 0.188 0.82  0.616 0.595
 0.521 0.268 0.09  0.885 0.546 0.569 0.183 0.639 0.329 0.274 0.161 0.865
 0.73  0.134 0.137 0.478 0.361 0.312 0.036 0.243 0.805 0.168 0.103 0.179
 0.529 0.227 0.706 0.075 0.804 0.708 0.766 0.381 0.046 0.428 0.112 0.041
 0.85  0.517 0.72  0.056 0.548 0.436 0.201 0.523 0.081 0.403 0.671 0.752
 0.194 0.657 0.476 0.729 0.911 0.78  0.35  0.636 0.632 0.226 0.798 0.781
 0.148 0.029 0.12  0.651 0.257 0.204 0.231 0.18  0.617 0.458 0.142 0.054
 0.374 0.491 0.216 0.572 0.32  0.212 0.545 0.314 0.393 0.599 0.33  0.663
 0.159 0.185 0.371 0.506 0.448 0.128 0.269 0.333 0.125 0.091 0.53  0.303
 0.682 0.456 0.584 0.337 0.51  0.819 0.543 0.81  0.189 0.213 0.068 0.033
 0.261 0.071 0.41  0.712 0.515 0.593 0.203 0.286 0.457 0.654 0.122 0.345
 0.825 0.1   0.206 0.976 0.17  0.292 0.139 0.109 0.278 0.324 0.745 0.402
 0.397 0.045 0.177 0.611 0.284 0.578 0.318 0.803 0.594 0.684 0.019 0.722
 0.032 0.115 0.511 0.306 0.104 0.219 0.709 0.621 0.082 0.553 0.465 0.707
 0.166 0.859 0.677 0.253 0.586 0.425 0.801 0.084 0.645 0.149 0.343 0.878
 0.304 0.814 0.342 0.848 0.163 0.222 0.469 0.519 0.272 0.325 0.702 0.181
 0.693 0.809 0.479 0.468 0.356 0.811 0.34  0.63  0.372 0.637 0.507 0.749
 0.129 0.674 0.794 0.582 0.464 0.065 0.315 0.691 0.501 0.218 0.56  0.175
 0.5   0.378 0.613 0.313 0.727 0.239 0.603 0.57  0.27  0.034 0.247 0.737
 0.124 0.589 0.534 0.237 0.136 0.789 0.777 0.52  0.653 0.016 0.346 0.721
 0.675 0.138 0.266 0.442 0.326 0.301 0.717 0.023 0.025 0.25  0.281 0.796
 0.296 0.334 0.471 0.571 0.352 0.143 0.608 0.775 0.67  0.321 0.696 0.689
 0.624 0.408 0.157 0.439 0.672 0.302 0.225 0.357 0.527 0.431 0.831 0.755
 0.786 0.026 0.659 0.416 0.451 0.052 0.404 0.394 0.391 0.736 0.854 0.791
 0.126 0.363 0.874 0.297 0.341 0.344 0.208 0.733 0.234 0.116 0.828 0.365
 0.182 0.384 0.526 0.396 0.031 0.516 0.748 0.354 0.349 0.233 0.497 0.248
 0.339 0.132 0.588 0.764 0.705 0.575 0.536 0.021 0.205 0.835 0.549 0.74
 0.889 0.083 0.596 0.735 0.827 0.522 0.711 0.377 0.351 0.242 0.366 0.697
 0.328 0.778 0.743 0.492 0.715 0.623 0.488 0.263 0.568 0.089 0.779 0.47
 0.264 0.415 0.58  0.452 0.289 0.635 0.229 0.75  0.695 0.6   0.784 0.173
 0.822 0.812 0.265 0.574 0.475 0.295 0.662 0.3   0.566 0.994 0.669 0.04
 0.856 0.532 0.461 0.559 0.331 0.602 0.445 0.466 0.597 0.646 0.474 0.305
 0.556 0.742 0.631 0.718 0.606 0.647 0.758 0.644 0.499 0.873 0.245 0.487
 0.558 0.49  0.121 0.869 0.797 0.437 0.772 0.7   0.934 0.857 0.015 0.547
 0.353 0.699 0.495 0.409 0.29  0.293 0.494 0.477 0.235 0.894 0.417 0.881
 0.207 0.928 0.484 0.852 0.038 0.228 0.643 0.655 0.283 0.642 0.581 0.379
 0.542 0.579 0.434 0.44  0.535 0.913 0.776 0.551 0.401 0.273 0.172 0.375
 0.714 0.668 0.362 0.833 0.633 0.783 0.614 0.763 0.844 0.744 0.61  0.453
 0.481 0.563 0.418 0.399 0.348 0.59  0.413 0.498 0.267 0.398 0.386 0.815
 0.249 0.429 0.799 0.751 0.821 0.323 0.107 0.807 0.816 0.99  0.573 0.449
 0.883 0.768 0.925 0.773 0.38  0.604 0.411 0.832 0.184 0.438 0.552 0.792
 0.376 0.641 0.37  0.158 0.426 0.277 0.493 0.629 0.02  0.236 0.21  0.726
 0.531 0.92  0.949 0.628 0.731 0.518 0.358 0.554 0.893 0.943 0.944 0.601
 0.307 0.725 0.368 0.924 0.661 0.151 0.769 0.576 0.424 0.664 0.024 0.922
 0.537 0.884 0.483 0.462 0.899 0.622 0.013 0.954 0.683 0.192 0.774 0.824
 0.858 0.984 0.414 0.561 0.879 0.504 0.509 0.968 0.918 0.836 0.332 0.028
 0.938 0.541 0.48  0.533 0.528 0.254 0.423 0.288 0.369 0.93  0.813 0.915
 0.364 0.688 0.902 0.868 0.942 0.567 0.022 0.703 0.585 0.906 0.754 0.855
 0.839 0.681 0.298 0.872 0.455 0.929 0.008 0.388 0.912 0.322 0.853 0.454
 0.685 0.747 0.66  0.904 0.738 0.485 0.496 0.577 0.927 0.746 0.565 0.634
 0.887 0.845 0.951 0.444 0.427 0.962 0.564 0.851 0.897 0.876 0.421 0.012
 0.649 0.759 0.84  0.842 0.87  0.097 0.983 0.433 0.387 0.441 0.767 0.903
 0.592 0.895 0.896 0.652 0.8   0.4   0.017 0.862 0.849 0.676 0.999 0.921
 0.673 0.948 0.004 0.864 0.55  0.787 0.392 0.443 0.505 0.538 0.71  0.367
 0.985 0.741 0.826 0.817 0.508 0.26  0.618 0.94  0.916 0.823 0.9   0.193
 0.419 0.841 0.919 0.959 0.723 0.486 0.54  0.905 0.385 0.782 0.006 0.502
 0.802 0.875 0.931 0.011 0.926 0.728 0.382 0.335 0.891 0.871 0.701 0.739
 0.383 0.834 0.898 0.389 0.901 0.988 0.907 0.686 0.86  0.882 0.861 0.917
 0.555 0.808 0.338 0.96  0.972 0.01  0.847 0.964 0.886 0.995 0.818 0.958
 0.627 0.992 0.952 0.91  0.978 0.973 0.971 0.945 0.914 0.977 0.956 0.909
 0.005 0.007 0.014 0.009]


Feature Engineering¶

In [239]:
## Treating Income Category = abc
df.loc[df[df['Income_Category'] == 'abc'].index, 'Income_Category'] = 'Unknown'
df['Income_Category'].unique()
Out[239]:
array(['$60K - $80K', 'Less than $40K', '$80K - $120K', '$40K - $60K',
       '$120K +', 'Unknown'], dtype=object)
In [240]:
df1 = df.copy()
df1.describe(include="all").T
Out[240]:
count unique top freq mean std min 25% 50% 75% max
Attrition_Flag 10127 2 Existing Customer 8500 NaN NaN NaN NaN NaN NaN NaN
Customer_Age 10127.0 NaN NaN NaN 46.32596 8.016814 26.0 41.0 46.0 52.0 73.0
Gender 10127 2 F 5358 NaN NaN NaN NaN NaN NaN NaN
Dependent_count 10127.0 NaN NaN NaN 2.346203 1.298908 0.0 1.0 2.0 3.0 5.0
Education_Level 10127 7 Graduate 3128 NaN NaN NaN NaN NaN NaN NaN
Marital_Status 10127 4 Married 4687 NaN NaN NaN NaN NaN NaN NaN
Income_Category 10127 6 Less than $40K 3561 NaN NaN NaN NaN NaN NaN NaN
Card_Category 10127 4 Blue 9436 NaN NaN NaN NaN NaN NaN NaN
Months_on_book 10127.0 NaN NaN NaN 35.928409 7.986416 13.0 31.0 36.0 40.0 56.0
Total_Relationship_Count 10127.0 NaN NaN NaN 3.81258 1.554408 1.0 3.0 4.0 5.0 6.0
Months_Inactive_12_mon 10127.0 NaN NaN NaN 2.341167 1.010622 0.0 2.0 2.0 3.0 6.0
Contacts_Count_12_mon 10127.0 NaN NaN NaN 2.455317 1.106225 0.0 2.0 2.0 3.0 6.0
Credit_Limit 10127.0 NaN NaN NaN 8631.953698 9088.77665 1438.3 2555.0 4549.0 11067.5 34516.0
Total_Revolving_Bal 10127.0 NaN NaN NaN 1162.814061 814.987335 0.0 359.0 1276.0 1784.0 2517.0
Avg_Open_To_Buy 10127.0 NaN NaN NaN 7469.139637 9090.685324 3.0 1324.5 3474.0 9859.0 34516.0
Total_Amt_Chng_Q4_Q1 10127.0 NaN NaN NaN 0.759941 0.219207 0.0 0.631 0.736 0.859 3.397
Total_Trans_Amt 10127.0 NaN NaN NaN 4404.086304 3397.129254 510.0 2155.5 3899.0 4741.0 18484.0
Total_Trans_Ct 10127.0 NaN NaN NaN 64.858695 23.47257 10.0 45.0 67.0 81.0 139.0
Total_Ct_Chng_Q4_Q1 10127.0 NaN NaN NaN 0.712222 0.238086 0.0 0.582 0.702 0.818 3.714
Avg_Utilization_Ratio 10127.0 NaN NaN NaN 0.274894 0.275691 0.0 0.023 0.176 0.503 0.999
In [241]:
# For dropping columns
columns_to_drop = [
    "credit_limit",
    "dependent_count",
    "months_on_book",
    "avg_open_to_buy",
    "customer_age"
]


# For masking a particular value in a feature
column_to_mask_value = "income_category"
value_to_mask = "abc"
masked_value = "Unknown"
loss_func = "logloss"

# Test and Validation sizes
test_size = 0.2
val_size = 0.25

# Dependent Varibale Value map
target_mapper = {"Attrited Customer": 1, "Existing Customer": 0}
In [242]:
cat_columns = df1.select_dtypes(include="object").columns.tolist()
df1[cat_columns] = df1[cat_columns].astype("category")
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   Attrition_Flag            10127 non-null  category
 1   Customer_Age              10127 non-null  int64   
 2   Gender                    10127 non-null  category
 3   Dependent_count           10127 non-null  int64   
 4   Education_Level           10127 non-null  category
 5   Marital_Status            10127 non-null  category
 6   Income_Category           10127 non-null  category
 7   Card_Category             10127 non-null  category
 8   Months_on_book            10127 non-null  int64   
 9   Total_Relationship_Count  10127 non-null  int64   
 10  Months_Inactive_12_mon    10127 non-null  int64   
 11  Contacts_Count_12_mon     10127 non-null  int64   
 12  Credit_Limit              10127 non-null  float64 
 13  Total_Revolving_Bal       10127 non-null  int64   
 14  Avg_Open_To_Buy           10127 non-null  float64 
 15  Total_Amt_Chng_Q4_Q1      10127 non-null  float64 
 16  Total_Trans_Amt           10127 non-null  int64   
 17  Total_Trans_Ct            10127 non-null  int64   
 18  Total_Ct_Chng_Q4_Q1       10127 non-null  float64 
 19  Avg_Utilization_Ratio     10127 non-null  float64 
dtypes: category(6), float64(5), int64(9)
memory usage: 1.1 MB

Split Train, test¶

In [243]:
X = df1.drop(columns=["Attrition_Flag"])
y = df1["Attrition_Flag"].map(target_mapper)

#Splitting into 80-20
# Splitting data into training, validation and test set:
# first we split data into 2 parts, say temporary and test

X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1, stratify=y
)

# then we split the temporary set into train and validation

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)

Fit the Dataset¶

In [244]:
print(
    "Training data shape: \n\n",
    X_train.shape,
    "\n\nTesting Data Shape: \n\n",
    X_test.shape,
)
Training data shape: 

 (6075, 19) 

Testing Data Shape: 

 (2026, 19)
In [245]:
print("Training: \n", y_train.value_counts(normalize=True))
print("\n\nValidation: \n", y_val.value_counts(normalize=True))
print("\n\nTest: \n", y_test.value_counts(normalize=True))
Training: 
 Attrition_Flag
0    0.839342
1    0.160658
Name: proportion, dtype: float64


Validation: 
 Attrition_Flag
0    0.839092
1    0.160908
Name: proportion, dtype: float64


Test: 
 Attrition_Flag
0    0.839585
1    0.160415
Name: proportion, dtype: float64
In [246]:
from sklearn.compose import ColumnTransformer
from sklearn.base import TransformerMixin

# Building a function to standardize columns

def feature_name_standardize(df: pd.DataFrame):
    df_ = df.copy()
    df_.columns = [i.replace(" ", "_").lower() for i in df_.columns]
    return df_

# Building a function to drop features

def drop_feature(df: pd.DataFrame, features: list = []):
    df_ = df.copy()
    if len(features) != 0:
        df_ = df_.drop(columns=features)
        
    return df_

# Building a function to treat incorrect value

def mask_value(df: pd.DataFrame, feature: str = None, value_to_mask: str = None, masked_value: str = None):
    df_ = df.copy()
    if feature != None and value_to_mask != None:
        if feature in df_.columns:
            df_[feature] = df_[feature].astype('object')
            df_.loc[df_[df_[feature] == value_to_mask].index, feature] = masked_value
            df_[feature] = df_[feature].astype('category')
            
    return df_

# Building a custom imputer

def impute_category_unknown(df: pd.DataFrame, fill_value: str):
    df_ = df.copy()
    for col in df_.select_dtypes(include='category').columns.tolist():
        df_[col] = df_[col].astype('object')
        df_[col] = df_[col].fillna('Unknown')
        df_[col] = df_[col].astype('category')
    return df_

# Building a custom data preprocessing class with fit and transform methods for standardizing column names

class FeatureNamesStandardizer(TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        """All SciKit-Learn compatible transformers and classifiers have the
        same interface. `fit` always returns the same object."""
        return self

    def transform(self, X):
        """Returns dataframe with column names in lower case with underscores in place of spaces."""
        X_ = feature_name_standardize(X)
        return X_
    
    
# Building a custom data preprocessing class with fit and transform methods for dropping columns

class ColumnDropper(TransformerMixin):
    def __init__(self, features: list):
        self.features = features

    def fit(self, X, y=None):
        """All SciKit-Learn compatible transformers and classifiers have the
        same interface. `fit` always returns the same object."""
        return self

    def transform(self, X):
        """Given a list of columns, returns a dataframe without those columns."""
        X_ = drop_feature(X, features=self.features)
        return X_
        
    

# Building a custom data preprocessing class with fit and transform methods for custom value masking

class CustomValueMasker(TransformerMixin):
    def __init__(self, feature: str, value_to_mask: str, masked_value: str):
        self.feature = feature
        self.value_to_mask = value_to_mask
        self.masked_value = masked_value

    def fit(self, X, y=None):
        """All SciKit-Learn compatible transformers and classifiers have the
        same interface. `fit` always returns the same object."""
        return self

    def transform(self, X):
        """Return a dataframe with the required feature value masked as required."""
        X_ = mask_value(X, self.feature, self.value_to_mask, self.masked_value)
        return X_
    
    
# Building a custom class to one-hot encode using pandas
class PandasOneHot(TransformerMixin):
    def __init__(self, columns: list = None):
        self.columns = columns

    def fit(self, X, y=None):
        """All SciKit-Learn compatible transformers and classifiers have the
        same interface. `fit` always returns the same object."""
        return self

    def transform(self, X):
        """Return a dataframe with the required feature value masked as required."""
        X_ = pd.get_dummies(X, columns = self.columns, drop_first=True)
        return X_
    
# Building a custom class to fill nulls with Unknown
class FillUnknown(TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        """All SciKit-Learn compatible transformers and classifiers have the
        same interface. `fit` always returns the same object."""
        return self

    def transform(self, X):
        """Return a dataframe with the required feature value masked as required."""
        X_ = impute_category_unknown(X, fill_value='Unknown')
        return X_
In [247]:
# To Standardize feature names
feature_name_standardizer = FeatureNamesStandardizer()

X_train = feature_name_standardizer.fit_transform(X_train)
X_val = feature_name_standardizer.transform(X_val)
X_test = feature_name_standardizer.transform(X_test)

# To Drop unnecessary columns
column_dropper = ColumnDropper(features=columns_to_drop)

X_train = column_dropper.fit_transform(X_train)
X_val = column_dropper.transform(X_val)
X_test = column_dropper.transform(X_test)

# To Mask incorrect/meaningless value of a feature
value_masker = CustomValueMasker(
    feature=column_to_mask_value, value_to_mask=value_to_mask, masked_value=masked_value
)

X_train = value_masker.fit_transform(X_train)
X_val = value_masker.transform(X_val)
X_test = value_masker.transform(X_test)

# To impute categorical Nulls to Unknown
cat_columns = X_train.select_dtypes(include="category").columns.tolist()
imputer = FillUnknown()

X_train[cat_columns] = imputer.fit_transform(X_train[cat_columns])
X_val[cat_columns] = imputer.transform(X_val[cat_columns])
X_test[cat_columns] = imputer.transform(X_test[cat_columns])

# To encode the data
one_hot = PandasOneHot()

X_train = one_hot.fit_transform(X_train)
X_val = one_hot.transform(X_val)
X_test = one_hot.transform(X_test)


# Scale the numerical columns
robust_scaler = RobustScaler(with_centering=False, with_scaling=True)
num_columns = [
    "total_relationship_count",
    "months_inactive_12_mon",
    "contacts_count_12_mon",
    "total_revolving_bal",
    "total_amt_chng_q4_q1",
    "total_trans_amt",
    "total_trans_ct",
    "total_ct_chng_q4_q1",
    "avg_utilization_ratio",
]

X_train[num_columns] = pd.DataFrame(
    robust_scaler.fit_transform(X_train[num_columns]),
    columns=num_columns,
    index=X_train.index,
)
X_val[num_columns] = pd.DataFrame(
    robust_scaler.transform(X_val[num_columns]), columns=num_columns, index=X_val.index
)
X_test[num_columns] = pd.DataFrame(
    robust_scaler.transform(X_test[num_columns]),
    columns=num_columns,
    index=X_test.index,
)

Model Building - Original Data¶

Metrics for evaluation¶

In [248]:
def get_metrics_score(model, train, test, train_y, test_y, threshold=0.5, flag=False, roc=True):

    # defining an empty list to store train and test results

    score_list = []
    pred_train = model.predict_proba(train)[:, 1] > threshold
    pred_test = model.predict_proba(test)[:, 1] > threshold

    pred_train = np.round(pred_train)
    pred_test = np.round(pred_test)

    train_acc = accuracy_score(pred_train, train_y)
    test_acc = accuracy_score(pred_test, test_y)

    train_recall = recall_score(train_y, pred_train)
    test_recall = recall_score(test_y, pred_test)

    train_precision = precision_score(train_y, pred_train)
    test_precision = precision_score(test_y, pred_test)

    train_f1 = f1_score(train_y, pred_train)
    test_f1 = f1_score(test_y, pred_test)

    pred_train_proba = model.predict_proba(train)[:, 1]
    pred_test_proba = model.predict_proba(test)[:, 1]

    train_roc_auc = roc_auc_score(train_y, pred_train_proba)
    test_roc_auc = roc_auc_score(test_y, pred_test_proba)

    score_list.extend(
        (
            train_acc,
            test_acc,
            train_recall,
            test_recall,
            train_precision,
            test_precision,
            train_f1,
            test_f1,
            train_roc_auc,
            test_roc_auc,
        )
    )

    if flag == True:

        print("Accuracy on training set : ", accuracy_score(pred_train, train_y))
        print("Accuracy on test set : ", accuracy_score(pred_test, test_y))
        print("Recall on training set : ", recall_score(train_y, pred_train))
        print("Recall on test set : ", recall_score(test_y, pred_test))
        print("Precision on training set : ", precision_score(train_y, pred_train))
        print("Precision on test set : ", precision_score(test_y, pred_test))
        print("F1 on training set : ", f1_score(train_y, pred_train))
        print("F1 on test set : ", f1_score(test_y, pred_test))

    if roc == True:
        if flag == True:
            print(
                "ROC-AUC Score on training set : ",
                roc_auc_score(train_y, pred_train_proba),
            )
            print(
                "ROC-AUC Score on test set : ", roc_auc_score(test_y, pred_test_proba)
            )
    return score_list 
In [249]:
# # defining empty lists to add train and test results

model_names = []
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_train = []
f1_test = []
roc_auc_train = []
roc_auc_test = []
cross_val_train = []


def add_score_model(model_name, score, cv_res):
    """Add scores to list so that we can compare all models score together"""
    model_names.append(model_name)
    acc_train.append(score[0])
    acc_test.append(score[1])
    recall_train.append(score[2])
    recall_test.append(score[3])
    precision_train.append(score[4])
    precision_test.append(score[5])
    f1_train.append(score[6])
    f1_test.append(score[7])
    roc_auc_train.append(score[8])
    roc_auc_test.append(score[9])
    cross_val_train.append(cv_res)
In [250]:
## for confusion matrix
def make_confusion_matrix(model, test_X, y_actual, labels=[1, 0]):
    """
    model : classifier to predict values of X
    test_X: test set
    y_actual : ground truth

    """
    y_predict = model.predict(test_X)
    cm = metrics.confusion_matrix(y_actual, y_predict, labels=[1, 0])
    df_cm = pd.DataFrame(
        cm,
        index=[i for i in ["Actual - Attrited", "Actual - Existing"]],
        columns=[i for i in ["Predicted - Attrited", "Predicted - Existing"]],
    )
    group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
    labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
    labels = np.asarray(labels).reshape(2, 2)
    plt.figure(figsize=(5, 3))
    sns.heatmap(df_cm, annot=labels, fmt="", cmap="Blues").set(title="Confusion Matrix")
In [251]:
print(
    "Training data shape: \n\n",
    X_train.shape,
    "\n\nValidation Data Shape: \n\n",
    X_val.shape,
    "\n\nTesting Data Shape: \n\n",
    X_test.shape,
)
Training data shape: 

 (6075, 27) 

Validation Data Shape: 

 (2026, 27) 

Testing Data Shape: 

 (2026, 27)

Build 5 models¶

  • Bagging
  • Random Forest Classifier
  • Gradient Boosting
  • Decision Tree Classifier
  • Adaptive Boosting
In [252]:
models = []  # Empty list to store all the models
cv_results = []
# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1, algorithm='SAMME')))
models.append(("Decisiontree", DecisionTreeClassifier(random_state=1)))

# For each model, run cross validation on 9 folds (+ 1 validation fold) with scoring for recall
for name, model in models:
    scoring = "recall"
    kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)  # Setting number of splits equal to 10

    cv_result = cross_val_score(estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold)
    cv_results.append(cv_result)
    model.fit(X_train, y_train)
    model_score = get_metrics_score(model, X_train, X_val, y_train, y_val)
    add_score_model(name, model_score, cv_result.mean())

print("Added all models!")
Added all models!

Compare 5 models¶

In [253]:
comparison_frame = pd.DataFrame(
    {
        "Model": model_names,
        "Cross_Val_Score_Train": cross_val_train,
        "Train_Accuracy": acc_train,
        "Test_Accuracy": acc_test,
        "Train_Recall": recall_train,
        "Test_Recall": recall_test,
        "Train_Precision": precision_train,
        "Test_Precision": precision_test,
        "Train_F1": f1_train,
        "Test_F1": f1_test,
        "Train_ROC_AUC": roc_auc_train,
        "Test_ROC_AUC": roc_auc_test,
    }
)

# Sorting models in decreasing order of test recall
comparison_frame.sort_values(by=["Cross_Val_Score_Train", "Test_Recall"], ascending=False).style.highlight_max(color="green", axis=0).highlight_min(color="orange", axis=0)
Out[253]:
  Model Cross_Val_Score_Train Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_F1 Test_F1 Train_ROC_AUC Test_ROC_AUC
2 GBM 0.817620 0.969712 0.969398 0.873975 0.874233 0.933260 0.931373 0.902646 0.901899 0.992689 0.989937
0 Bagging 0.785862 0.996049 0.954590 0.980533 0.822086 0.994802 0.887417 0.987616 0.853503 0.999899 0.978021
1 Random forest 0.770440 1.000000 0.959526 1.000000 0.812883 1.000000 0.926573 1.000000 0.866013 1.000000 0.983956
4 Decisiontree 0.754113 1.000000 0.937315 1.000000 0.806748 1.000000 0.804281 1.000000 0.805513 1.000000 0.884551
3 Adaboost 0.729560 0.942551 0.955084 0.753074 0.803681 0.871886 0.906574 0.808136 0.852033 0.978865 0.980718
  • Best model is Gradient Boosting and the next best are Bagging and RandomForest respectively

Cross-validation Result¶

In [254]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Models Comparison")
ax = fig.add_subplot(111)

plt.boxplot(cv_results)
ax.set_xticklabels(model_names)

plt.show()
No description has been provided for this image

Model Building - Oversampled data¶

In [255]:
print("Before OverSampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Overampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

sm = SMOTE(
    sampling_strategy="minority", k_neighbors=10, random_state=1
)  # Synthetic Minority Over Sampling Technique

X_train_over, y_train_over = sm.fit_resample(X_train, y_train)


print("After OverSampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After OverSampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))


print("After OverSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before OverSampling, counts of label 'Yes': 976
Before Overampling, counts of label 'No': 5099 

After OverSampling, counts of label 'Yes': 5099
After OverSampling, counts of label 'No': 5099 

After OverSampling, the shape of train_X: (10198, 27)
After OverSampling, the shape of train_y: (10198,) 

Train¶

In [256]:
models_over = []

# Appending models into the list

models_over.append(("Bagging OverSampling", BaggingClassifier(random_state=1)))
models_over.append(("Random forest OverSampling", RandomForestClassifier(random_state=1)))
models_over.append(("GBM OverSampling", GradientBoostingClassifier(random_state=1)))
models_over.append(("Adaboost OverSampling", AdaBoostClassifier(random_state=1, algorithm='SAMME')))
models_over.append(("Decision Tree OverSampling", DecisionTreeClassifier(random_state=1)))

for name, model in models_over:
    scoring = "recall"
    kfold = StratifiedKFold(
        n_splits=10, shuffle=True, random_state=10
    )  # Setting number of splits equal to 10

    cv_result_over = cross_val_score(
        estimator=model, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold
    )
    cv_results.append(cv_result_over)

    model.fit(X_train_over, y_train_over)
    model_score_over = get_metrics_score(
        model, X_train_over, X_val, y_train_over, y_val
    )
    add_score_model(name, model_score_over, cv_result_over.mean())

print("Adding Oversampling models Completed!")
Adding Oversampling models Completed!

Compare models¶

In [257]:
comparison_frame = pd.DataFrame(
    {
        "Model": model_names,
        "Cross_Val_Score_Train": cross_val_train,
        "Train_Accuracy": acc_train,
        "Test_Accuracy": acc_test,
        "Train_Recall": recall_train,
        "Test_Recall": recall_test,
        "Train_Precision": precision_train,
        "Test_Precision": precision_test,
        "Train_F1": f1_train,
        "Test_F1": f1_test,
        "Train_ROC_AUC": roc_auc_train,
        "Test_ROC_AUC": roc_auc_test,
    }
)

# Sorting models in decreasing order of test recall
comparison_frame.sort_values(by=["Test_Recall", "Cross_Val_Score_Train"], ascending=False
).style.highlight_max(color="green", axis=0).highlight_min(color="orange", axis=0)
Out[257]:
  Model Cross_Val_Score_Train Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_F1 Test_F1 Train_ROC_AUC Test_ROC_AUC
7 GBM OverSampling 0.967445 0.970484 0.957552 0.975093 0.917178 0.966187 0.835196 0.970620 0.874269 0.995988 0.988800
8 Adaboost OverSampling 0.941948 0.934105 0.918559 0.947245 0.911043 0.922989 0.685912 0.934959 0.782609 0.982744 0.972768
6 Random forest OverSampling 0.980780 1.000000 0.956565 1.000000 0.895706 1.000000 0.843931 1.000000 0.869048 1.000000 0.985522
2 GBM 0.817620 0.969712 0.969398 0.873975 0.874233 0.933260 0.931373 0.902646 0.901899 0.992689 0.989937
5 Bagging OverSampling 0.962738 0.996960 0.943731 0.996862 0.861963 0.997058 0.802857 0.996960 0.831361 0.999969 0.973466
0 Bagging 0.785862 0.996049 0.954590 0.980533 0.822086 0.994802 0.887417 0.987616 0.853503 0.999899 0.978021
9 Decision Tree OverSampling 0.943519 1.000000 0.923001 1.000000 0.819018 1.000000 0.733516 1.000000 0.773913 1.000000 0.880980
1 Random forest 0.770440 1.000000 0.959526 1.000000 0.812883 1.000000 0.926573 1.000000 0.866013 1.000000 0.983956
4 Decisiontree 0.754113 1.000000 0.937315 1.000000 0.806748 1.000000 0.804281 1.000000 0.805513 1.000000 0.884551
3 Adaboost 0.729560 0.942551 0.955084 0.753074 0.803681 0.871886 0.906574 0.808136 0.852033 0.978865 0.980718
  • After oversampling, Gradient Boost Oversampling model, Adaboost Oversampled, RandomForest Oversampled Models follow compared to earlier models.

Model Building - Undersampled data¶

In [258]:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
In [259]:
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))

print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Under Sampling, counts of label 'Yes': 976
Before Under Sampling, counts of label 'No': 5099 

After Under Sampling, counts of label 'Yes': 976
After Under Sampling, counts of label 'No': 976 

After Under Sampling, the shape of train_X: (1952, 27)
After Under Sampling, the shape of train_y: (1952,) 

Build Models¶

In [260]:
models_under = []

# Appending models into the list

models_under.append(("Bagging UnderSampling", BaggingClassifier(random_state=1)))
models_under.append(("Random forest UnderSampling", RandomForestClassifier(random_state=1)))
models_under.append(("GBM UnderSampling", GradientBoostingClassifier(random_state=1)))
models_under.append(("Adaboost UnderSampling", AdaBoostClassifier(random_state=1, algorithm='SAMME')))
models_under.append(("DecisionTree UnderSampling", DecisionTreeClassifier(random_state=1)))

for name, model in models_under:
    scoring = "recall"
    kfold = StratifiedKFold(
        n_splits=10, shuffle=True, random_state=1
    )  # Setting number of splits equal to 10

    cv_result_under = cross_val_score(
        estimator=model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
    )
    cv_results.append(cv_result_under)

    model.fit(X_train_un, y_train_un)
    model_score_under = get_metrics_score(model, X_train_un, X_val, y_train_un, y_val)
    add_score_model(name, model_score_under, cv_result_under.mean())

print("Adding Undersampling models Completed!")
Adding Undersampling models Completed!

Compare Undersampling Models¶

In [261]:
comparison_frame = pd.DataFrame(
    {
        "Model": model_names,
        "Cross_Val_Score_Train": cross_val_train,
        "Train_Accuracy": acc_train,
        "Test_Accuracy": acc_test,
        "Train_Recall": recall_train,
        "Test_Recall": recall_test,
        "Train_Precision": precision_train,
        "Test_Precision": precision_test,
        "Train_F1": f1_train,
        "Test_F1": f1_test,
        "Train_ROC_AUC": roc_auc_train,
        "Test_ROC_AUC": roc_auc_test,
    }
)

# Sorting models in decreasing order of test recall
comparison_frame.sort_values(
    by=["Test_Recall", "Cross_Val_Score_Train"], ascending=False
).style.highlight_max(color="green", axis=0).highlight_min(color="orange", axis=0)
Out[261]:
  Model Cross_Val_Score_Train Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_F1 Test_F1 Train_ROC_AUC Test_ROC_AUC
12 GBM UnderSampling 0.951799 0.967725 0.938796 0.979508 0.957055 0.956957 0.739336 0.968101 0.834225 0.995357 0.989747
11 Random forest UnderSampling 0.935388 1.000000 0.928430 1.000000 0.932515 1.000000 0.711944 1.000000 0.807437 1.000000 0.979840
10 Bagging UnderSampling 0.920029 0.994365 0.924482 0.990779 0.932515 0.997936 0.698851 0.994344 0.798949 0.999701 0.972970
13 Adaboost UnderSampling 0.917968 0.928791 0.918559 0.933402 0.926380 0.924873 0.681716 0.929118 0.785436 0.979916 0.979626
7 GBM OverSampling 0.967445 0.970484 0.957552 0.975093 0.917178 0.966187 0.835196 0.970620 0.874269 0.995988 0.988800
8 Adaboost OverSampling 0.941948 0.934105 0.918559 0.947245 0.911043 0.922989 0.685912 0.934959 0.782609 0.982744 0.972768
6 Random forest OverSampling 0.980780 1.000000 0.956565 1.000000 0.895706 1.000000 0.843931 1.000000 0.869048 1.000000 0.985522
14 DecisionTree UnderSampling 0.896423 1.000000 0.891412 1.000000 0.886503 1.000000 0.612288 1.000000 0.724311 1.000000 0.889428
2 GBM 0.817620 0.969712 0.969398 0.873975 0.874233 0.933260 0.931373 0.902646 0.901899 0.992689 0.989937
5 Bagging OverSampling 0.962738 0.996960 0.943731 0.996862 0.861963 0.997058 0.802857 0.996960 0.831361 0.999969 0.973466
0 Bagging 0.785862 0.996049 0.954590 0.980533 0.822086 0.994802 0.887417 0.987616 0.853503 0.999899 0.978021
9 Decision Tree OverSampling 0.943519 1.000000 0.923001 1.000000 0.819018 1.000000 0.733516 1.000000 0.773913 1.000000 0.880980
1 Random forest 0.770440 1.000000 0.959526 1.000000 0.812883 1.000000 0.926573 1.000000 0.866013 1.000000 0.983956
4 Decisiontree 0.754113 1.000000 0.937315 1.000000 0.806748 1.000000 0.804281 1.000000 0.805513 1.000000 0.884551
3 Adaboost 0.729560 0.942551 0.955084 0.753074 0.803681 0.871886 0.906574 0.808136 0.852033 0.978865 0.980718
  • After Undersampling, Gradient Boosting undersampled, Randomforest undersampled, Bagging undersampled models perform better than all other models
  • Best 3 models are as follows:
    • Gradient Boosting Undersampling
    • Random Forest Undersamling
    • Bagging Undersampling

Model Performance Improvement using Hyperparameter Tuning¶

Choice of 3 models that can be tuned¶

  • Gradient Boosting OverSampling: Worth tuning to potentially enhance already strong performance metrics.
  • Adaboost UnderSampling: Needs tuning to improve performance metrics where Accuracy is least of all other models.
  • Random Forest UnderSampling: Needs tuning due to potential overfitting (perfect train and test accuracy).

These models are selected based on their performance metrics and potential for improvement through tuning. This helps to optimize performance while avoiding overfitting and improving generalization to new data.

Tuning - Gradient Boosting Oversampling¶

In [262]:
# defining model - Gradient Boosting Oversampling
model = GradientBoostingClassifier(random_state=1)

# Parameter grid
param_grid = {
    'n_estimators': [50, 100, 500],
    'learning_rate': [0.05, 0.1, 0.2],
    'max_depth': [3, 5, 10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'subsample': [0.8, 1.0],
    'max_features': ['sqrt', 'log2']
}
# RandomizedSearchCV setup
gbm_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid, 
                               n_iter=5, scoring=scorer, cv=3, random_state=1, n_jobs=1)

# Fit the RandomizedSearchCV
gbm_tuned.fit(X_train_over, y_train_over)

# Output the best parameters and the best score
print("Best parameters found: ", gbm_tuned.best_params_)
print("Best CV recall score: ", gbm_tuned.best_score_)
Best parameters found:  {'subsample': 1.0, 'n_estimators': 500, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 10, 'learning_rate': 0.05}
Best CV recall score:  0.9786257198582788
In [263]:
# Create a new GradientBoostingClassifier with the best parameters
best_gbm = GradientBoostingClassifier(**gbm_tuned.best_params_, random_state=1)
# Fit the model on training data
best_gbm.fit(X_train_over, y_train_over)
Out[263]:
GradientBoostingClassifier(learning_rate=0.05, max_depth=10,
                           max_features='sqrt', min_samples_leaf=2,
                           min_samples_split=5, n_estimators=500,
                           random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(learning_rate=0.05, max_depth=10,
                           max_features='sqrt', min_samples_leaf=2,
                           min_samples_split=5, n_estimators=500,
                           random_state=1)
In [264]:
gbm_tuned_model_score = get_metrics_score(best_gbm, X_train, X_val, y_train, y_val)

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)

scoring = "recall"
gbm_down_cv = cross_val_score(estimator=best_gbm, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold)
add_score_model("Tuned GBM Over Sampling", gbm_tuned_model_score, gbm_down_cv.mean())
make_confusion_matrix(best_gbm, X_val, y_val)
No description has been provided for this image

Tuning - AdaBoost Undersampling¶

In [265]:
# Define the base model (AdaBoostClassifier)
model = AdaBoostClassifier(random_state=1, algorithm='SAMME')

# Parameter grid for AdaBoost (smaller ranges for faster tuning)
param_grid = {
    'n_estimators': [50, 100, 500],  # Reduced number of estimators
    'learning_rate': [0.01, 0.1, 0.5, 1.0],  # Focus on a few key values for learning rate
}

# Scoring metric (recall)
scorer = metrics.make_scorer(recall_score)

# RandomizedSearchCV setup (reduced n_iter and lower cv)
adaboost_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid, 
                                    n_iter=10, scoring=scorer, cv=3, random_state=1, n_jobs=1)

# Fit the RandomizedSearchCV
adaboost_tuned.fit(X_train_un, y_train_un)

# Output the best parameters and the best score
print("Best parameters found: ", adaboost_tuned.best_params_)
print("Best CV recall score: ", adaboost_tuned.best_score_)
Best parameters found:  {'n_estimators': 500, 'learning_rate': 1.0}
Best CV recall score:  0.9364826175869121
In [266]:
# Create a new AdaBoostingClassifier with the best parameters
best_ada = AdaBoostClassifier(**adaboost_tuned.best_params_, random_state=1, algorithm='SAMME')
# Fit the model on training data
best_ada.fit(X_train_un, y_train_un)
Out[266]:
AdaBoostClassifier(algorithm='SAMME', n_estimators=500, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(algorithm='SAMME', n_estimators=500, random_state=1)
In [267]:
ada_tuned_model_score = get_metrics_score(best_ada, X_train, X_val, y_train, y_val)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
scoring = "recall"
ada_down_cv = cross_val_score(estimator=best_ada, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold)
add_score_model("Tuned AdaBoost Under Sampling", ada_tuned_model_score, ada_down_cv.mean())
make_confusion_matrix(best_ada, X_val, y_val)
No description has been provided for this image

Tuning - Random Forest UnderSampling¶

In [268]:
# Define the model
model = RandomForestClassifier(random_state=1)

# Parameter grid
param_grid = {
    'n_estimators': [50, 100, 150],  # Limit the range for speed
    'max_depth': [10, 20, 30, None],  # Include None for unlimited depth
    'min_samples_split': [2, 5, 10],  # Focus on reasonable values
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['log2', 'sqrt'],  # Common options
    'bootstrap': [True, False]  # Bootstrap sampling (whether or not to sample with replacement)
}

# Scoring metric (recall)
scorer = metrics.make_scorer(recall_score)

# RandomizedSearchCV setup (reduced n_iter and lower cv)
rf_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid, 
                              n_iter=10, scoring=scorer, cv=3, random_state=1, n_jobs=1)

# Fit the RandomizedSearchCV
rf_tuned.fit(X_train_un, y_train_un)

# Output the best parameters and the best score
print("Best parameters found: ", rf_tuned.best_params_)
print("Best CV recall score: ", rf_tuned.best_score_)
Best parameters found:  {'n_estimators': 150, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 30, 'bootstrap': False}
Best CV recall score:  0.9364920560012585
In [269]:
# Create a new RandomForestClassifier with the best parameters
best_rf = RandomForestClassifier(**rf_tuned.best_params_, random_state=1)
# Fit the model on training data
best_rf.fit(X_train_un, y_train_un)
Out[269]:
RandomForestClassifier(bootstrap=False, max_depth=30, min_samples_split=10,
                       n_estimators=150, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(bootstrap=False, max_depth=30, min_samples_split=10,
                       n_estimators=150, random_state=1)
In [270]:
rf_tuned_model_score = get_metrics_score(best_rf, X_train, X_val, y_train, y_val)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
scoring = "recall"
rf_cv = cross_val_score(estimator=best_rf, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold)
add_score_model("Tuned RandomForest Under Sampling", rf_tuned_model_score, rf_cv.mean())
make_confusion_matrix(best_rf, X_val, y_val)
No description has been provided for this image
In [271]:
comparison_frame = pd.DataFrame(
    {
        "Model": model_names,
        "Cross_Val_Score_Train": cross_val_train,
        "Train_Accuracy": acc_train,
        "Test_Accuracy": acc_test,
        "Train_Recall": recall_train,
        "Test_Recall": recall_test,
        "Train_Precision": precision_train,
        "Test_Precision": precision_test,
        "Train_F1": f1_train,
        "Test_F1": f1_test,
        "Train_ROC_AUC": roc_auc_train,
        "Test_ROC_AUC": roc_auc_test,
    }
)
# model_names
# comparison_frame
# for col in comparison_frame.select_dtypes(include="float64").columns.tolist():
#     comparison_frame[col] = comparison_frame[col] * 100, 0).astype(int)


comparison_frame.tail(3).sort_values(
    by=["Cross_Val_Score_Train", "Test_Recall"], ascending=False
)
Out[271]:
Model Cross_Val_Score_Train Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_F1 Test_F1 Train_ROC_AUC Test_ROC_AUC
15 Tuned GBM Over Sampling 0.985096 1.000000 0.964462 1.000000 0.889571 1.000000 0.889571 1.000000 0.889571 1.000000 0.990971
17 Tuned RandomForest Under Sampling 0.944635 0.941564 0.931392 1.000000 0.947853 0.733283 0.716937 0.846121 0.816380 0.995619 0.981972
16 Tuned AdaBoost Under Sampling 0.938470 0.935473 0.936328 0.957992 0.953988 0.727061 0.731765 0.826702 0.828229 0.987684 0.989175

Performance of tuned models:¶

Tuned GBM Over Sampling:¶
  • High Training and Test Scores: The model achieves perfect accuracy and recall on the training data and very high accuracy and recall on the test data, indicating strong performance on both datasets.
  • Precision Trade-Off: The precision is lower on the training set compared to the test set, which could indicate that the model is more conservative on the training set, potentially overfitting to the training data.
  • ROC AUC: The ROC AUC scores are excellent, showing that the model has a high ability to distinguish between the classes.
Tuned RandomForest Under Sampling:¶
  • High Recall on Training Set: The model achieves perfect recall on the training set but has slightly lower recall on the test set. This suggests the model is very good at identifying positive cases during training but slightly less effective on unseen data.
  • Precision and F1 Scores: The precision and F1 scores are lower compared to the GBM model, indicating a trade-off between precision and recall.
  • ROC AUC: Very high ROC AUC scores on both training and test data, indicating the model's strong discriminative power.
Tuned AdaBoost Under Sampling¶
  • Balanced Performance: The AdaBoost model shows relatively consistent performance across training and test datasets, with close accuracy, recall, precision, and F1 scores.
  • High ROC AUC: The ROC AUC scores are also high, demonstrating strong performance in distinguishing between classes.
  • Moderate Precision and Recall: The model has a good balance of precision and recall, which might be preferred depending on the application’s requirements.

Model Performance Comparison and Final Model Selection¶

  • Of all, Gradient boosting of oversampling data is having the highest accuracy.
In [272]:
#Lets check the test data
feature_names = X_train.columns
importances =best_gbm.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="green", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
No description has been provided for this image
In [273]:
gbm_tuned_model_test_score = get_metrics_score(
    best_gbm, X_train, X_test, y_train, y_test
)

final_model_names = ["gbm Tuned Down-sampled Trained"]
final_acc_train = [gbm_tuned_model_test_score[0]]
final_acc_test = [gbm_tuned_model_test_score[1]]
final_recall_train = [gbm_tuned_model_test_score[2]]
final_recall_test = [gbm_tuned_model_test_score[3]]
final_precision_train = [gbm_tuned_model_test_score[4]]
final_precision_test = [gbm_tuned_model_test_score[5]]
final_f1_train = [gbm_tuned_model_test_score[6]]
final_f1_test = [gbm_tuned_model_test_score[7]]
final_roc_auc_train = [gbm_tuned_model_test_score[8]]
final_roc_auc_test = [gbm_tuned_model_test_score[9]]

final_result_score = pd.DataFrame(
    {
        "Model": final_model_names,
        "Train_Accuracy": final_acc_train,
        "Test_Accuracy": final_acc_test,
        "Train_Recall": final_recall_train,
        "Test_Recall": final_recall_test,
        "Train_Precision": final_precision_train,
        "Test_Precision": final_precision_test,
        "Train_F1": final_f1_train,
        "Test_F1": final_f1_test,
        "Train_ROC_AUC": final_roc_auc_train,
        "Test_ROC_AUC": final_roc_auc_test,
    }
)


for col in final_result_score.select_dtypes(include="float64").columns.tolist():
    final_result_score[col] = final_result_score[col] * 100


final_result_score
Out[273]:
Model Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_F1 Test_F1 Train_ROC_AUC Test_ROC_AUC
0 gbm Tuned Down-sampled Trained 100.0 97.137216 100.0 92.923077 100.0 89.614243 100.0 91.238671 100.0 99.324922
  • Performance is very good on the test data
In [274]:
make_confusion_matrix(best_gbm, X_test, y_test)
No description has been provided for this image
In [275]:
y_pred_prob = best_gbm.predict_proba(X_test)[:, 1]  # Probability estimates for the positive class

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)
roc_auc = metrics.auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()
No description has been provided for this image
  • AUC is 0.99, which indicates that the GBM Oversampling Tuned model has excellent performance. A value close to 1 means the model is almost perfect at distinguishing between positive and negative classes.

Actionable Insights and Recommendations¶

Top Features:¶
  • total_trans_amt (Total Transaction Amount) and total_trans_ct (Total Transaction Count):

Customers who spend and transact more frequently are likely important for business outcomes (e.g., loyal customers or high-value customers). Recommendation: Focus on these customers with rewards programs, personalized offers, or loyalty initiatives to retain and engage them further.

  • total_revolving_bal (Total Revolving Balance):

Customers with a higher revolving balance could indicate frequent credit usage or a reliance on credit. Recommendation: Provide these customers with financial advice, credit management tools, or promotional interest rate offers to reduce their balance.

  • total_ct_chng_q4_q1 (Change in Transaction Count from Q4 to Q1) and total_relationship_count:

A large change in transaction counts between quarters might indicate seasonal behavior or changes in customer needs. Customers with multiple product relationships (savings, loans, etc.) are likely to be more engaged. Recommendation: Tailor marketing efforts based on seasonal trends, and offer cross-product promotions to further deepen customer relationships.

  • avg_utilization_ratio (Average Utilization Ratio):

High utilization ratios may signal riskier financial behavior or higher credit dependency. Recommendation: Offer these customers debt counseling or credit limit adjustments to improve their financial health.

  • months_inactive_12_mon (Months Inactive in Last 12 Months):

Customers who have been inactive for several months are at risk of churn. Recommendation: Re-engage these customers with targeted offers, reminders, or personalized services to bring them back into active usage.

  • contacts_count_12_mon (Number of Contacts in the Last 12 Months):

More contact with the customer (e.g., customer service interactions) may indicate either dissatisfaction or a strong relationship. Recommendation: Analyze the nature of these interactions to identify potential pain points or opportunities for enhancing customer support.

Lower Importance Features:¶
  • Income Category, Education Level, Marital Status:

These demographic features have much lower importance compared to transaction and account-specific metrics. Actionable Insights: While demographics are useful for segmentation, focus more on behavioral and account-specific features for predictive purposes.

By focusing on transaction activity, account balances, and relationship depth, the business can target specific customer groups for retention, growth, and engagement.

  • Churn Prevention: Focus retention efforts on customers who show declining transaction activity or months of inactivity. Proactive engagement could help prevent churn.